Regression Analysis with Python Theory 


Simple Linear Regression 
e This type of regression allows us to study the relationship between two types of 
variables. One variable being the independent variable (x) and the other variable being 
the dependent variable (y). 
e This type of regression is characterized by the formula: Y = Bo + B1X +£, and is graphed 
as a Straight line where: 
X - the value of the independent variable, 
Y - value of the dependent variable 
Bo - is a constant. 
B1 - the regression coefficient. 
m (This tells us how much Y changes for each unit change in X). 
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o €- The residual (error) term. This is the difference between the actual y and the 
predicted y. The predicted y is denoted as y. This error is evaluated for each 
observation. These errors are also called residuals. 

e A regression line can show a positive linear relationship, a negative linear relationship, or 
no relationship. 
e Simple linear regression is appropriate for modeling linear trends where the data is 
uniformly spread around the line else other modeling techniques need to be considered. 
e Examples: 
o Relationship between sales and advertising costs. 
e Limitations 

o To understand whether one variable causes another, you will need additional 
research and statistical analysis. 

o This type of regression is prone to outliers which would lead to inaccurate 
prediction by the model. 


Multiple Linear Regression 
e This type of regression is used to predict the value of a variable based on the value of 
two or more other variables. 
e This type of regression can also be used to identify the strength of the effect that the 
independent variables have on a dependent variable. 
e The multiple linear regression equation is as follows: Y = Bo + B1X1 + B2X2 +... + BpXp 
+E, 
Where; 
o Y is the dependent variable 
o X1 through Xp are the distinct independent variables 
o Bois the value of Y when all independent variables (X1 through Xp) are equal to 
zero, and B1 and B2 are the estimated regression coefficients. 


o €- The residual (error) term. This is the difference between the actual y and the 
predicted y. The predicted y is denoted as y. This error is evaluated for each 
observation. These errors are also called residuals. 

e Assumptions (These are the same as for simple linear regression) 

o Linear relationship: A linear relationship is assumed between the dependent 

variable and the independent variables. 
Normality: The values of the residuals are normally distributed. 
Multicollinearity: The independent variables are not correlated with one another. 
Homoscedasticity: The size of the error in our prediction doesn’t change 
significantly across the values of the independent variable. 
No Outliers: There are no influential cases biasing your model. 
The values of the residuals are independent such that one observation of the 
error term should not allow us to predict the next observation. 

e Limitations 

o This type of regression is prone to multicollinearity. This is the case where there 
are independent variables that are highly correlated with the dependent variable. 
This is because this type of regression assumes there is no relationship among 
independent variables. 

o It is prone to noise and overfitting meaning if the number of observations is lesser 
than the number of features, it may lead to overfitting because it starts 
considering noise in this scenario while building the model. 

e Improving your model 

o Need more data: We need to have a huge amount of data to get the best 
possible prediction. 

o Bad assumptions: We made the assumption that this data has a linear 
relationship, but that might not be the case. Visualizing the data may help you 
determine that. 

o Poor features: The features we used may not have had a high enough correlation 
to the values we were trying to predict. 

e Applications 

o Sales forecasting 

o Satisfaction Analysis 

o Price Estimation 

o Employment income 


K-Nearest Neighbors (KNN) 

e K-Nearest Neighbor (KNN) is a machine learning algorithm that can be used for both 
regression and classification. Our session will focus on using KNN for regression. 

e KNN is a non-parametric algorithm (which means that it does not make any assumptions 
on the underlying data distribution). 

e tis also an instance-based algorithm which means it compares new problem instances 
with instances seen in training, which have been stored in memory. As such, it looks at 
the nearest neighbors to decide what any new point should be. 


In regression, it works by storing all available cases and using a similarity 
measure/distance function to calculate the average of the numerical target of the K 
nearest neighbors. Such distance functions can be either euclidean, manhattan or 
Minkowski. 

While performing regression, we will have to choose the parameter k. How do we 
choose the right value of K? 

We simply run the KNN algorithm several times with different values of K and choose the 
K that reduces the number of errors we encounter while maintaining the algorithm’s 
ability to accurately make predictions when it’s given data it hasn’t seen before. 

o As we decrease the value of K to 1, our predictions become less stable. 

o As we increase the value of K, our predictions become more stable due to 
majority averaging, and thus, more likely to make more accurate predictions (up 
to a certain point). Eventually, we begin to witness an increasing number of 
errors. 

It's commonly used for its ease of interpretation and low calculation time. 
Benefits 

o It’s easy to implement as there are only two parameters to implement i.e. 
euclidean and manhattan. 

o No training period as it learns from the training dataset. 

Limitations 

o If data has a huge dimension, KNN becomes time-consuming. Hence, applying 
KNN when data has a huge dimension is not advisable. By high dimension, this 
would be in the case where there is significantly a large no. of variables/features 
as compared to the observations. 

o Feature scaling needs to be done before applying the algorithm else wrong 
predictions could be resulted. 

o Sensitive to noisy data, missing values, and outliers. 


Decision Trees Regression 


This is a type of supervised learning algorithm that can be used for classification and 
regression problems. Our focus will be its application to regression problems. 
This type of regression uses decision trees to split the data multiple times according to 
certain cutoff values in the features. Through splitting, different subsets of the dataset 
are created, with each instance belonging to one subset. The final subsets are called 
terminal or leaf nodes and the intermediate subsets are called internal nodes or split 
nodes. To predict the outcome in each leaf node, the average outcome of the training 
data in this node is used. 
There are several algorithms that can be used to build decision trees i.e. CART, ID3, 
C4.5, etc. 
We use decision trees where there are non-linear or complex relationships between 
features and labels. 
Benefits 

o It requires little data preparation. 


o Performs well with large datasets. 
o They require relatively less effort for training the algorithm. 
Limitations 

o There is a high probability of overfitting in the Decision Tree. 

o It gives low prediction accuracy for a dataset as compared to other machine 
learning algorithms. 

o A small change in the training data can result in a large change in the tree and 
consequently the final predictions. 


Support Vector Machine - Regression 


This is a regression algorithm that can be used to work with continuous values. 

It can be used to avoid difficulties of using linear functions in the high dimensional 
feature space. 

SVR technique relies on kernel functions to construct the model. The commonly used 
kernel functions are: 


o Linear 

o Polynomial 
o RBF 

o Sigmoid 


While implementing the SVR technique, the user needs to select the appropriate kernel 
function. 
The selection of a kernel function is tricky and requires optimization techniques for the 
best selection. In order to do this, we define how much error is acceptable in our model 
and will find an appropriate line (or hyperplane in higher dimensions) to fit the data. 
Advantages 

o SVR is robust to the outliers. 

o SVR performs lower computation compared to other regression techniques. 

o Its implementation is easy. 
Disadvantages 
Choosing an appropriate Kernel function is difficult and complex. 
Complexity and memory requirements of SVM are very high. 
One must do feature scaling of variables before applying SVM. 
SVM takes a long training time on large datasets. 
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